Compositions and systems for identifying and comparing expressed genes (mRNAs) in eukaryotic organisms

ABSTRACT

The invention comprises compositions and systems to identify and compare expressed genes in a given in vivo or in vitro RNA sample, as well as the relative difference in mRNA expression between two or more samples, where desired. Furthermore, the invention comprises compositions and systems to identify novel genes. The invention comprises, without limitation, one or more mRNA specific identimers for use in reverse transcription that themselves comprise an oligo-T nucleotide sequence (at the 5′ end) linked to a nucleotide sequence VNx (at the 3′ end) where the V nucleotide immediately adjacent to the oligo-T segment is not a T.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.10/002,536, filed on Nov. 1, 2001, which claims priority based on U.S.provisional patent application No. 60/244,933 filed Nov. 1, 2000.

FIELD OF THE INVENTION

The present invention relates to the fields of genomic and proteomicanalysis. In particular, the present invention relates to the field ofgene expression analysis.

BACKGROUND OF THE INVENTION

With progress in sequencing many genomes, including among them the humangenome, there is additional interest in understanding the significanceof changes in gene expression. The ability to correlate changes in geneexpression, for example, with specific treatments and phenotypes inclinical and non-clinical biological systems, allows scientists tounderstand the underlying cell biology and identify the roles ofspecific genes, receptors and signaling pathways. One objective, amongmany, of this research is to identify specific genes that may serve, forexample, as biomarkers for disease progression or diagnostic criteria,as well as to identify gene expression products (e.g., proteins) thatcan be targeted as or by new therapeutic compounds in order to study,diagnose, prevent or cure disease.

There have been significant advancements in the human genome-sequencingproject and in similar sequencing efforts that involve organisms ofinterest to basic and preclinical research, genetics, and agronomics.This progress has generated and continues to generate deoxyribonucleicacid (“DNA”) and ribonucleic acid (“RNA”) sequence databases that serveas informational resources and support advancement in the methods bywhich genomic and proteomic research is carried out.

However, current genomic tools and techniques continue to requiresignificant known genomic sequence information for the organism ortissue under investigation, or require that the investigators derivelibraries of clones from the particular organism or tissue. In order tobuild a DNA microarray that represents essentially all genes for aparticular species under investigation (e.g., human), the investigatingscientist must expend tremendous resources to identify all possiblemessenger RNAs (“mRNAs”) that may be present in the studied sample. Forexample, high-density DNA microarrays, using large numbers of knowngenes, are required to conduct mRNA expression profiling in suchsamples. By comparison, use of low-density DNA microarrays creates ahigher probability of “missing” genes (by omission from the array) thatmay be relevant to a given experimental paradigm.

Alternative methods, such as differential display and serial analysis ofgene expression, may permit detection of differences in mRNA speciesbetween or among RNA samples. However, these methods also requiresignificant resources to identify expressed genes and related expressionproducts, such as mRNAs. For example, in order to identify differencesin specific genes using differential display, segregated bands must beremoved (excised) from an electrophoresis gel, amplified usingpolymerase chain reaction (“PCR”) techniques, and then sequenced.Similarly, serial analysis of gene expression (“SAGE”) requiressignificant sequencing resources to identify any differences in knownand unknown genes.

The present invention addresses limitations in the prior art bycomprising compositions and systems that incorporate novel strategieswhereby molecular or biochemical assay compositions and systems arelinked to DNA or RNA sequence databases for optimal resource efficiencyin assaying gene expression.

SUMMARY OF THE INVENTION

The invention comprises compositions and systems to identify any or allgenes expressed in a given in vivo or in vitro RNA sample, as well asthe relative differences in mRNA between two or more samples, wheredesired. Furthermore, the invention comprises compositions and systemsto identify novel genes (expressed as mRNA), by way of one example only,by detecting mRNA 3′ end fragments that do not correspond to any knownsequence. Thus, embodiments of the invention (1) may identify any or allgenes (mRNAs) expressed in a given eukaryotic sample, (2) supportdiscovery of novel genes, or (3) may identify mRNAs that are expressedat different levels between two or more samples. The invention furthercomprises custom microarray design and production where genes shown tobe present and/or differentially regulated between two samples can beused to produce project- or disease-specific microarrays that detect thegenes of interest. Moreover, embodiments of the invention comprisesystems of nucleic acid fragment collection and analysis.

The invention comprises compositions and systems to identify therelative expression level of any or all eukaryotic mRNAs in one or moresamples. The invention comprises, without limitation, one or more mRNAspecific primers for use in reverse transcription that themselvescomprises an oligo-dT nucleotide sequence (at the 5′ end) linked to anucleotide sequence (at the 3′ end) where the nucleotide immediatelyadjacent to the oligo-dT segment is not a T. This sequence can bewritten (from 5′ to 3′ end) as Tn-VNx, where n=any integer of 8 orgreater describing how many T nucleotides are present; V=nucleotides A,G, or C; each N=nucleotides A, G, C, or T, and x=any integer 3 orgreater that describes how many N nucleotides are present (SEQ ID NO:1). (For purposes of the invention, the designation “d”, or “deoxy”,shall also include the “nondeoxy” form where appropriate as known bythose of ordinary skill). The complete primer (oligo-dT region+VNxsequence) of the invention is called an “identimer.”

Embodiments of the invention may employ different or every combinationof the Tn-VNx sequence. The 5′ end of the identimer comprises a reportermarker or molecule, by way of example only and without limitation, afluorescent molecule, which allows detection of the resulting fragment.The Tn-VNx sequence of the invention includes priming of all genescontaining the complementary sequence at the 3′ end (immediatelyadjacent to the poly-adenylation tail), providing information about thegene's identity (i.e. the mRNA's sequence at the 3′ end), limiting thenumber of reaction products to a useful number, and enabling detectionof the resulting 3′ end fragment, as one example only, by fluorescencedetection.

After deriving double-stranded complementary DNA (“cDNA”) in theinvention, one or more 3′ end fragments are generated by asequence-specific cleavage of the double-stranded cDNA, for example, byrestriction endonuclease cleavage or other sequence-specific cleavingagents known to those of ordinary skill. The invention generates 3′ endfragments of cDNA where the 5′ end is known (complementary to the VNxnucleotide sequence), the 3′ end is known (e.g., the restrictionendonuclease recognition sequence), and the size is known (by way of oneexample only, from electrophoretic separation or reverse HPLC). In someembodiments, the resulting data is accumulated, analyzed, and stored ina database. The resulting data permits identification of expressed mRNAin a single sample, or, by comparing the abundance of fragments from acontrol sample to a test (or unknown) sample, the relative differencesin expression of genes of interest may be determined among samples.

The invention allows research and clinical scientists to identify any orall genes expressed in a given in vivo or in vitro RNA sample, as wellas the relative differences in mRNA between two or more samples, wheredesired. Furthermore, novel gene (mRNA) discovery is made possible sincemRNA 3′ end fragments can be identified that do not correspond to anyknown sequence.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a schematic diagram of one embodiment of the inventionwhich generates 3′ cDNA fragments for the assay of poly-adenylated mRNAsin eukaryotic samples.

FIG. 2 shows a comparative fingerprint analysis between PCR amplifiedsamples by an existing method and by the invention using an intermediatelinear amplification step.

FIG. 3 shows a reverse phase HPLC separation of control and treatedsamples showing a 3′ cDNA fragment of similar size but differentabundance, indicating a difference in expression level for a specificmRNA between the samples.

FIG. 4 shows the identification of mRNA in a test sample from cDNAfragments.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As is commonly known, transcription is the transfer of the geneticinformation from a host's archival copy of DNA to mRNA. In typicaltranscription, RNA polymerase binds to a particular region of archivalDNA and begins to make a strand of mRNA with a base sequencecomplementary to the DNA template that is “downstream” of the RNApolymerase binding site. When this transcription is finished, theportion of the DNA that coded for a protein, i.e., a gene, is nowrepresented by an mRNA molecule that can be used as a template fortranslation into gene products, such as proteins. Thus, levels of geneexpression can be evaluated by identifying and characterizing the mRNAcomplement of a host.

Those of ordinary skill will also appreciate that, to further evaluatethe levels of gene expression, polyA mRNA may be prepared from thedesired organism or tissue, and the first strand of cDNA may besynthesized from the mRNA template using, as some examples only, anRNA-dependent DNA polymerase, such as reverse transcriptase, and anoligonucleotide primer. A second strand of cDNA may be synthesized byone of the methods known to one of ordinary skill in the art, forexample, by replacement synthesis or by primed synthesis. The resultingcDNA strands are thus available for further analysis andcharacterization.

With this background in mind, the invention comprises an identimer withthree or more nucleotides upstream of a poly-T tail, combined with arestriction enzyme that cleaves ds DNA in a sequence-specific fashion,to generate 3-prime end cDNA fragments of expressed genes. Theexpression level of a given gene is proportional to and correlates withthe amount or abundance of the respective 3′ cDNA fragment level. Genes(expressed as mRNAs) are identified by combining the known sequence of 3or more nucleotides immediately adjacent to the poly-A tail(complementary to the Nx base-anchored primer)(SEQ ID NO: 1), thespecific DNA sequence recognized or cut by the restriction enzyme(s)employed, and the size of the 3′ fragment. The size of the 3′ fragmentrepresents the distance between the Nx base-anchored priming(poly-adenylation site) and the nearest restriction enzyme cut site. Theidentity of the mRNA (gene) may be derived by searching an mRNA or DNAdatabase for the nucleotide sequence that matches the N(x)-base primingsite, the restriction enzyme cut site, and the distance between thepriming site and the cut site. Ambiguous calls are avoided by repeatingthe protocol with one or more restriction enzymes that recognize or cuta different nucleic acid sequence.

In one embodiment, without limitation, the invention comprises up to 192different identimers that represent all combinations of the primerdesignated as 5′ Tn-VNNN 3′ (n=an integer of preferably 21 representingthe number of T's; V=nucleotides A, G, or C but not T; each N=nucleotideA, G, C, or T)(SEQ ID NO: 2), which is designed to identify the 4nucleotides immediately adjacent to the poly-adenylation site ineukaryotic mRNA. The Tn or poly-dT in the identimer is designed toanneal to the poly-A tail in eukaryotic mRNA. In one embodiment, withoutlimitation, more than one set of 192 identimers (the permutations ofVNNN=3×4×4×4) is employed, and each set is used with a single RNAsample. These identimer sets, and thus the samples, are differentiatedby adding a distinct, detectable molecular label or marker, by way ofone example only, a fluorescent label, to the 5′ end of each identimerin a set, with all identimers within a set having a similar 5′ marker.The identimer is annealed to mRNA using buffer and temperatureconditions that are known in the art for optimal sequence-specificpriming, and reverse transcription is carried out. Second-strandsynthesis is subsequently carried out to produce double-stranded (“ds”)cDNA that is amenable to restriction enzyme cleavage. Enzyme-mediated,sequence-specific cleavage is carried out, resulting in fragmented dscDNA. For each set where different cleavage enzymes or agents are used,the invention will generate different 3′ end fragments forcharacterization. In this manner, the invention generates and analyzescDNA fragments that are assayed for size (e.g., mobility in a gel) andamount.

In one embodiment, without limitation, the invention generates 3′ cDNAfragments where the 4 nucleotides immediately adjacent to thepoly-adenylated tail (VNNN) are known, the sequence-specific location(i.e., restriction site) is known, and the size of the fragment isdetermined and used to establish the distance between the restrictionsite and the poly-adenylated tail. The amount of the cDNA fragment, aswell as the sample it represents, is determined by assaying the signalintensity of the 5′ label on the employed identimer set by means knownto those of ordinary skill in the art. Differences in the normalizedintensity of the 5′ labels for a specific cDNA fragment indicatedifferences in the respective mRNA among or between the RNA sample setsunder investigation. Information regarding the identity of an expressedmRNA is derived from knowing the 4 nucleotides immediately adjacent tothe poly-adenylated tail, the nucleotide sequence for the restrictionsite employed, and the distance between the poly-adenylated tail and therestriction site nearest to the poly-adenylation site with respect tothe primary nucleic acid sequence (i.e., the most 3′ restriction site inthe mRNA sequence).

In another embodiment, without limitation, samples representing controland one or more experimental samples of mRNA may each be divided into192 aliqouts, with one identimer of the 192 identimers added to eachrespective aliquots. The respective identimer sets may contain differentdistinct, detectable molecular labels or markers on the 5′ end of eachidentimer in a set, with all identimers within a set having a similar 5′label. Reverse transciption and second strand synthesis is similarlyconducted according to means known to those of ordinary skill in the artto produce ds cDNA that is amenable to restriction enzyme cleavage.Enzyme-mediated, sequence-specific cleavage is carried out, resulting infragmented ds cDNA. Comparison of fragmented ds cDNA corresponding toeach respective identimer yields data showing the relative expression ofthe corresponding gene between control and test samples. One of ordinaryskill will appreciate that other useful embodiments are apparent, assome examples only, by varying the type and number of markers in eachidentimer set, as well as by varying the sequence-specific cleavageenzymes used to produce fragmented ds cDNA.

In one embodiment, without limitation, the invention comprises a systemthat compares two or more RNA samples resulting in (1) cDNA fragments ofessentially all mRNAs (genes) present in the samples, (2) identifies thesubset of genes that are different between the samples, and (3) providesimmediate sequence information to identify known genes from the cDNAfragments. This invention thus has significant advantages over existingart because prior sequence information or clone library construction isnot needed to enable the assay, as in the case of DNA microarrays thatrequire significant resources to produce the DNA microarray. Instead,the invention provides immediate sequence information, in addition toinformation concerning changes or differences in mRNA level, todetermine mRNA expression level and mRNA identification in one assay.Furthermore, the invention generates cDNA fragments from all mRNAspresent in the sample for subsequent investigation by common molecularbiology techniques such as cloning, PCR, sequencing, etc. It isimportant to note that the invention does not require prior knowledge ofthe sequence of the genome of the organism under investigation and canbe employed in organisms that lack significant genomic sequenceinformation. Thus by identifying specific differences in mRNAs levels,critical changes in gene activity relevant to the research or diseasemodel under study can be derived where little or no genomic sequenceinformation is available. In this paradigm, preliminary information canbe utilized to direct selective cloning and sequencing efforts, as wellas to produce a DNA microarray specific to the relevant mRNAs for a moreexhaustive, high throughput study across many samples in the research ordisease model under investigation.

The determination of the size and abundance of 3′ cDNA fragmentsrequires labeling the fragments with a detectable marker entity orgrouping. In one embodiment of the invention, without limitation,labeling the 3′ cDNA fragments involves fluorescence. By way of oneexample only, a sample derived from a biological reference or controlgroup is labeled with one fluorescent molecule or group, and a samplederived from a biological test or study group is labeled with adifferent fluorescent molecule or group. The detection method of theinvention comprises a fluorescence detector that can identify anddistinguish the fluorescent molecules or groups employed duringlabeling. One method of labeling involves the addition of one or morefluorescent groups to the identimer, preferably at the 5′ end of theidentimer to avoid interfering with reverse transcription priming.Another labeling method, among others, uses fluorescence-modifiedreverse transcription substrates (e.g., fluorescence-modifiedtrinucleotides) that are added to the reverse transcription reaction andincorporated into the cDNA during reverse transcription.

FIG. 1 shows a molecular protocol of one embodiment of the invention togenerate 3′ cDNA fragments for the assay of all poly-adenylated mRNAs ineukaryotic samples. The sample under investigation is divided into 192aliquots, and first strand synthesis (reverse transcription) is carriedout using all VNNN combinations of the identimer, followed by secondstrand synthesis. The ds cDNA is cleaved in a sequence-specific mannerusing a restriction enzyme that involves a 4-base recognition sequence(e.g., NlaIII). The resulting fragments are ligated to an adaptamer thatcontains one or more RNA polymerase promoter sites for subsequent invitro transcription. The 3′ cDNA fragments are initially enriched usingPCR, primed at the adaptamer and the poly-adenylation site (i.e.identimer), and subsequently employed as a template for in vitrotranscription promoted within the adaptamer (e.g. T7 polymerase promoterin the ligated adaptamer). This results in an amplification of thesequence adjacent to, and downstream from, the RNA polymerase promotersequence, which includes the restriction site and the poly-adenylationsite. “Second round” first strand synthesis is carried out using afluorescence-labeled primer (identimer) to enable the detection of all3′ cDNA fragments for size and abundance (fluorescence label is denotedas an “*” at the 5′ end of the identimer in FIG. 1). The entire processis repeated using a different restriction enzyme that employs adifferent recognition sequence (e.g. MboI). Gene (mRNA) identificationis made by collecting knowledge of the 4 nucleotides upstream of thepoly-adenylation site (determined by identimer priming), the sequence ofthe restriction enzyme recognition site, and the size of the fragmentthat provides the distance between the poly-adenylation site and theproximal restriction site. This information is employed to search theknown sequence database(s) to identify the mRNA(s) that match thesecriteria.

In some embodiments, the identification of a gene or mRNA utilizesinformation derived from the identimer sequence, the restriction enzymerecognition sequence(s), and the size of the resulting cDNA fragments.This information is then employed to search an mRNA sequence database toidentify the specific genes or mRNAs in the samples under investigation.The data used to search the mRNA database are derived by means of theinvention. The mRNA nucleotide sequence of the bases immediatelyadjacent to the poly-A tail are derived from knowledge of thecomplementary identimer sequence. For example, if the identimer for agiven reaction has the sequence 5′-TTTTTTTTTTTTTTTTTTTTAAAC-3′ (SEQ IDNO: 3), then any mRNAs identified from this reaction will contain thesequence 5′-GTTTAAAAAAAAAAAAAAAAAAAA-3′ (SEQ ID NO: 4). Furtherinformation is derived from the determination of the length of labeledcDNA fragments and the restriction enzyme employed to generate thefragments. For example, if the first restriction digest of the identimerreaction above employs the restriction enzyme NlaIII, which cuts at thesequence 5′-CATG-′3, then a cDNA fragment that is 334 bases in lengthidentifies and mRNA sequence that contains the 5′-CATG-′3 sequence 314bases from the poly-adenylation site. This take into account the 20 “T”bases on the identimer (i.e. 334−20=314). If the second restrictiondigest of the identimer reaction employs the restriction enzyme MboI,which cuts at the sequence 5′-GATC-3′, then a cDNA fragment that is 889bases in length identifies an mRNA sequence that contains the 5′-GATC-5′sequence 869 bases from the poly-adenylation sequence. Using thisinformation to search an appropriate database, one can identify the mRNAas human precerebellin (GI# 180250), which matches the analytical data.If no mRNA is present in the database, then one can employ a similarbioinformatical strategy to predict the identity of the unknown mRNA orapproximate the identity of mRNA or gene family involved. Similarly, ifthe samples are derived from an organism that lacks an adequate mRNA orgene sequence database, the mRNA is identified using the database from aclosely related species.

In some embodiments, the invention comprises an in vitro transcriptionstep for optimal sensitivity. Each reaction employs one or more primedreverse transcription step resulting in an mRNA:cDNA hybrid. ThemRNA:cDNA hybrid is converted to ds cDNA using a “second strand”reaction that may involve, as examples only, RNase H, a DNA polymerase,and a DNA ligase. Subsequently the ds cDNA is fragmented using one ormore restriction enzymes or other cleaving agents that cleave DNA in asequence-dependent fashion. Subsequently a specific DNA sequence, or“adaptamer”, containing one or more RNA polymerase promoters, is ligatedonto the restriction site, allowing in vitro transcription-mediatedamplification of the adaptamer-ligated fragments. The resulting RNAsequences include a 3′ poly-A tail and serve as a template foridentimer-primed reverse transcription that produces 3′ cDNA fragments.These fragments, which may correspond to cleaved fragments, are flankedby known sequences from the identimer hybridization site and restrictionsite, and the fragments can be analyzed for size and abundance todetermine the identity of the mRNA and the expression level in thesample, respectively.

In one embodiment of the invention, without limitation, identificationof the gene associated with a given mRNA fragment, observed as a gelband or chromatographic mobility peak, is attained by an automatedmatching of the fragment length, restriction enzyme(s), and the knownVNx recognition sequence to all predicted fragments obtained in acomputational analysis of an mRNA database. A representative set of mRNAsequences for the organism under investigation is submitted to acomputational analysis that, for each sequence, identifies the start ofthe poly-adenylation (polyA) site, the VNx sequence immediately upstreamof the polyA site (recognition sequence), and the location of allrestriction sites that would be subject to cleavage by the restrictionenzymes used in the biochemical protocol. For each restriction enzymeused, the cleavage site upstream and proximal to the mRNA 3′ end isidentified. The predicted fragment length is calculated by counting thenumber of nucleotides from the proximal cleavage site to the beginningof the polyA site and adding the number of nucleotides for the length ofthe oligo-T portion of the identimer primer. Thus, for each mRNAsequence in the representative database, an algorithm may predict thefragment lengths that would be observed if the given mRNA sequence werepresent in the sample and analyzed with the biochemical protocol. Acomputer application calculates all predicted fragments for a specifiedset of restriction enzymes and four-base recognition elements. Toassociate a gene with a fragment obtained in the biochemical protocol(the “target fragment”), the target fragment is compared to allpredicted fragments. Predicted fragments matching the target fragmentlength, restriction site, and VNx recognition sequence are putativematches. An unambiguous identification is obtained when only onepredicted fragment matches the target fragment. Gene identification isaccomplished by referring to the sequence information associated withthe predicted fragment. In the case where the target fragment matchesmultiple predicted fragments (multiple genes), the use of additionalrestriction enzymes provides an unambiguous identification. In such acase, identification requires that all predicted fragments for theputative gene must match target fragments observed in the experiment. Inother words, each restriction enzyme may produce an observed fragmentfor the given gene; when each of these target fragments is matched topredicted fragments from a single gene then identification of the targetfragments can be made.

In some embodiments, the invention comprises providing a transcript(mRNA) sequence database for the organism under investigation, as wellas an executable program to mine and match the database with the 3′ cDNAfragments.

Determining relative differences between two different RNA samplesinvolves comparing the abundance of all 3′ fragments, which have beendifferentially labeled for detection. For example, a comparativeincrease in the signal intensity of a specific 3′ cDNA fragment(s)indicates the mRNA that gave rise to the fragment(s) is more abundant inthe respective sample. Furthermore, the appearance or disappearance of aspecific 3′ cDNA fragment(s) indicates induction or repression,respectively, of the mRNA that gave rise to the fragment(s).

EXAMPLES

The following examples illustrate embodiments of the invention but in noway restrict the overall scope of the invention to only those describedbelow.

Example 1

First and second strand cDNA synthesis. First strand synthesis isperformed by means known to those of ordinary skill (using anyexperimental cell/tissue type) on the total RNA population utilizing afour-base identimer of sequence NNNVT₂₁, where each N=A, C, T, or G, andV=A, C, or G but not T (SEQ ID NO: 2). In practical application, thetotal number of unique identimer tags theoretically required to span thetotal estimated mRNA population (in a eukaryotic organism) would be 192(thus 192 unique subsets). Compared with most differential displayprotocols, which typically utilize a two-base anchored primer for firststrand synthesis, a four-based identimer has advantages by: (1)significantly reducing the complexity of the mRNA pool by a factor of 16(192/12=˜16), thereby reducing the number of bands displayed perfingerprint (or subset); (2) providing a more accurate prediction of thecandidate mRNA(s) of interest through the additional two nucleotidesequence at the 3′-end of each mRNA preceding the poly-A (along withrestriction site); and (3) allowing for more stringent annealingtemperature (e.g., 50° C.), thereby reducing potential mispriming duringfirst strand synthesis. Following first strand synthesis, ds cDNA 5′ wassynthesized using a cocktail of requisite enzymes (DNA Polymerase I,RNase H, and E. coli DNA ligase), for example, according to the methodof Gubler and Hoffman (Life Technologies Instruction and Technicalmanuals).

Restriction enzyme digestion. Following second strand synthesis, dscDNAs are digested separately with any four-base recognitionsequence-specific class IIS restriction enzymes, yielding a 4-basecohesive or recessive end, ideal for improving the efficiency ofsubsequent ligation reactions. Assuming the average size of asynthesized cDNA is approximately 1200 base pairs (“bp”), the 4-baserecognition enzyme will cut once every 256 bp (on average), generating 5different fragments (on average). However, the fragment of interest isthe most 3′ fragment, which is selected for in subsequent steps.

RNA polymerase-specific adaptor ligations. An extended ds promoterrecognition sequence specific for the respective RNA polymerase ofchoice (e.g. T7) is ligated (using a standard T4 DNA ligase protocol) tothe 5′ protruding region at the 3′ end of each ds cDNA containing thefour-base cut site. The RNA polymerase adaptor has an extendedcomplementary sequence to the cohesive or recessive end generated byeach respective enzyme. All cDNAs now acquire a universal promoter site(and primer site) specific for the RNA polymerase employed.

Selective PCR amplification of 3′-cDNA fragments. Ligation products aresubjected to a selective PCR amplification of the representative 3′-cDNApools by known means, using a sense strand specific primer derived fromthe RNA polymerase-specific ds adaptor and each individual identimertag(s) originally used for first strand synthesis. This step selectivelyamplifies the most 3′ ds fragments that are flanked by the adaptorsequence (“forward primer annealing site”) and the poly-A site (“reverseprimer” or “identimer annealing” site). PCR amplifications wereperformed using a HotStar™ Taq master mix (Qiagen, Valencia, Calif.,USA) in a 30 μl reaction format as follows: 2.5 U HotStar Taq™ DNAPolymerase (Qiagen); 1×PCR buffer; 200 μM dNTPs; 0.02 μM identimer tag;1.0 μM T3 sense primer. The PCR conditions were as follows: initialactivation and denaturing step at 95° C. for 15 min followed by 25cycles at 94° C. for 30 sec, 50° C. for 1 min, 72° C. for 1 min with afinal extension step for 5 min at 72° C. 3′ PCR fragments are purifiedfrom residual reaction components, quantified, and used for proceedingreactions.

Selective linear amplification of 3′ PCR pools. Another drawback to mostcurrent differential display-derived methods is the underrepresentationof the mRNA pool due to differential display's preferential bias towardhigh copy number mRNAs. Given that the percentage of low to rareabundance mRNAs can comprise up to 90% of the total mRNA pool,alternative or extended strategies (following PCR) can be employed toobtain more accurate representation of all expressed mRNAs. This entailsa linear amplification event (i.e., in vitro transcription) following anexponential amplification event (PCR) as a non-biased approach tofurther analyzing those messages that otherwise would not be observedfollowing PCR.

Each PCR reaction (containing a representative 3′ pool based on theidentimer used) is subjected to an in vitro transcription reaction usingthe RNA polymerase of interest for the appropriate time interval usingthe following reaction components (final concentration): 1-2 μg templateDNA; 7.5 mM of each individual nucleotide; 10 mM DTT; 1× reactionbuffer; RNA polymerase. Following linear amplification, in vitrotranscribed mRNAs are purified, quantified, and used for a second-roundfirst strand synthesis, respective to the first step in this overallmethod.

FIG. 2 shows a comparative fingerprint analysis between PCR amplifiedsamples using an existing approach and the invention's intermediatelinear amplification step. The samples are: Lane 1, 100 bp ladder; Lane2, 2 μg ‘control’ PCR fingerprint; Lane 3, 2 μg ‘treated’ PCRfingerprint; Lane 4, 2 μg ‘control’ RT product; Lane 5, 2 μg ‘treated’RT product. Control and treated samples represent total RNA samples fromuntreated and etoposide-induced apoptosis in HEK293 cell cultures,respectively. The arrows in FIG. 2 indicate unique bands displayed usingthe invention's linear amplification step for comparing expressionprofiles, which would otherwise not be detected using current art.Differences in migration patterns between PCR fingerprints and RTfingerprints are attributed to the RNA polymerase recognition siteligated into the fragments, which are not transcribed during in vitrotranscription in the system of the invention.

Second-round first strand synthesis and display of fingerprints.

In vitro transcribed mRNAs are subjected to a second-round first strandsynthesis reaction by ordinary means in order to generate adouble-stranded mRNA:cDNA duplex. Experimental samples are end-labeledusing 5′-fluorescence-labeled identimer tags (i.e., control sample,cy-3; treated sample, cy-5). Following synthesis, 3′ fingerprints areanalyzed for differences in expression levels using denaturinghigh-performance liquid chromatography (DHPLC) or gel electrophoresis.All detectable fragments are analyzed for abundance and subsequentlyemployed for gene (mRNA) identification by a bioinformatic method thatemploys the identimer sequence, the restriction site, and the length ofthe fragments. Fragments can be collected for subsequent applications orinvestigation such as DNA microarray production, sequencing, cloning,etc.

FIG. 3 shows an example of reverse phase HPLC separation of control andtreated samples. The results showing a 3′ cDNA fragment (arrows) ofsimilar size but different abundance, indicating a difference inexpression level for a specific mRNA between the samples. Note that allother peaks (bands) in the trace co-migrate and are at the sameabundance (based on peak height), indicating these 3′ cDNA fragments arederived from mRNAs that are present in both samples at the sameexpression level. Control and treated samples represent total RNAsamples from untreated and etoposide-induced apoptosis in HEK293 cellcultures, respectively.

Example 2

In one embodiment, the invention is a system whereby the identity andrelative expression level of all eukaryotic mRNAs (messenger RNAs) aredetermined. Some components of the invention include, without limitation(1) primer design & reverse transcription, (2) production ofdouble-stranded cDNA, (3) sequence-specific cleavage of ds cDNA, and (4)fragment detection & analysis.

Primer Design. The invention takes advantage of the polyadenylation ofeukaryotic mRNAs by utilizing an anchored oligo-T primer. The basicprimer design includes an oligo-T nucleotide sequence (5′ end) linked toa 5-nucleotide sequence (3′ end) where the bases immediately adjacent tothe oligo-T stretch is not a T. This sequence can be written (5′ to 3′)as Tn-VNNNN (n=any single integer of 8 or greater representing how manyT bases are present). The 5′ end of the primer contains one or morereporter molecules or markers (e.g. fluorescent molecule, hapten,biotin, radioisotope, etc.) that allows for detection of the resultingfragment (e.g. size determination) and enables collection of theresulting fragment if desired. Every combination of the VNNNN sequenceis employed (in this case, 768 combinations) and occasionalmodifications in length are utilized to accommodate common vs. rate 3′end sequences. The 768 combinations (i.e. reactions) are managed byemploying multi-well plates (e.g. 384-well plate) and multiple reportermolecules (e.g. two reactions per well in a 384-well plate using twodifferent fluorescent reporter molecules). The Tn-VNNNN sequencecomprises priming of all genes containing the complementary sequence atthe 3′ end (immediately adjacent to the poly-adenylation tail),providing information about the genes identity (i.e. the mRNA's sequenceat the 3′ end), and enabling detection of the resulting 3′ end fragment(e.g. fluorescence detection). The VNNNN sequence primer (including theoligo-T region) is called the “identimer” and the complementary sequence(i.e. mRNA sequence) to the 5 (non-oligo-T region) nucleotides of theidentimer sequence is called the “identifier” sequence (which utilizesDNA sequence nomenclature rather than RNA; T rather than U). The“identifier” sequence is used in identifying the specific mRNA in asequence database, along with the 3′ fragment size and restrictionenzyme recognition sequence.

Reverse Transcription. Reverse transcription (“RT”) is a molecularbiology protocol known to those of ordinary skill that allows acomplementary DNA sequence to be synthesized using an RNA template. Manyenzymes are available to carry out this reaction, which involves addingnucleotides to the 3′ end of the primer or growing DNA strand. A resultof RT is a RNA-DNA heteroduplex. Since the identimer primer utilizes thepoly-A tail on eukaryotic mRNAs, almost all RNA will be employed for RT.The basic premise involves annealing the identimer primer to the RNAthat enables RT.

Production of Double-Stranded cDNA. Once the RT reaction is complete,ds-cDNA is generated by any method utilized by those of ordinary skillin the art. This protocol known to those of ordinary skill may employ,for example, RNAse H to produce nicks and gaps in the RNA strand on theRNA-DNA heteroduplex while simultaneously employing DNA polymerase I, aswell as DNA ligase, to replace the RNA strand with “second strand” DNA.

Sequence-Specific cleavage of Double-Stranded cDNA. After deriving dscDNA, one or more 3′ end fragments are generated by a sequence-specificcleavage of the ds DNA. One embodiment, among others, would involveusing a restriction enzyme (“RE”) that includes a 4 base recognitionsequence. The average length of the cut fragments may be estimated at256 bases using a RE with a 4-base recognition sequence. Furthermore,the RE recognition sequence provides the 5′ end sequence information onthe fragment generated (i.e. the RE recognition sequence itself). Theresulting fragments can be ligated to a common primer for amplificationif desired.

Fragment Detection & Analysis. The previous steps generate 3′ endfragments where the identimer sequence and the RE recognition sequenceare known. Determining the size of the fragment in the system (as oneexample, by capillary electrophoresis) provides information about thelocation of the RE recognition sequence and enables analysis of thefragment information with the appropriate database. This creates a noveland desirable data set describing the mRNA expression profile(s) of RNAisolated from any eukaryotic samples or model system. Furthermore,collection of these fragments for sequencing may be utilized foridentifying novel sequences.

FIG. 4 shows the identification of mRNA in an HEK 293 test sample fromcDNA fragments. The identity of the specific mRNAs has been establishedfrom combining the specific VNNNN identimer used in the experiment (5′TTTTTTTTTTTTTTTTTTTTGGTTT 3′ (SEQ ID NO: 5)), the specific restrictionenzyme employed (NlaIII, cut site is 5′ CATG 3′), and the size of thefragments from the gel. Using this information, a search of publiclyavailable mRNA and DNA sequence databases produced the results for thesamples shown in Table 1: TABLE 1 Band Size (bp) GenBank ID Gene/mRNADescription 1 380 5102748 Homo Sapiens mRNA full length insert cDNAclone EUROIMAGE 35971 2 315  180250 Human precerebellin and cerebellinmRNA, complete cds 3 275 5102752 Homo Sapiens mRNA full length cDNAclone EUROIMAGE 609395 4 250 No ID No matching mRNA sequence 5 220 No IDNo matching mRNA sequence 6 200 No ID No matching mRNA sequence 7 1556807752 Homo Sapiens mRNA: cDNA DKFZp434L1016 8 120 3366801 Homo Sapiensorphan G protein- coupled receptor HG38 mRNA, complete cds

Preferred embodiments of the present Invention have been disclosed. Aperson of ordinary skill in the art would realize, however, that certainmodifications would come within the teachings of the invention, and thefollowing claims should be studied to determine the true scope andcontent of the invention. In addition, the methods and structures of thepresent invention can be incorporated in the form of a variety ofembodiments, only a few of which are described herein. It will beapparent to the artisan that other embodiments exist that do not departfrom the spirit of the invention. Thus, the described embodiments areillustrative and should not be construed as restrictive.

1-33. (canceled)
 34. A system for identification and characterization ofgene expression in one or more samples, comprised of: (a) providing oneor more samples comprising one or more mRNA molecules; (b) providing anidentimer comprising an oligo-dT primer of sequence, from 5′ to 3′ end,of Tn-VNx, where n is an integer 8 or greater but not more than 50representing the number of T's , V equals a nucleotide A, C, or G butnot T, each N equals a nucleotide A, C, G, or T, and x is an integer 3or greater but not more than 10 representing the number of Nnucleotides, said identimer also comprising a detectable marker at its5′ end; (c) contacting said mRNA with said identimer such that the polyTportion of the identimer hybridizes to the polyA tail of the mRNA andthe VNx portion of the identimer hybridizes with portions of the mRNAimmediately upstream of the polyA tail; (d) reverse transcribing themRNA to produce a first strand cDNA that includes the identimer; (e)synthesizing a second DNA strand complementary to the first strand cDNAto form a duplex; (f) cleaving the duplex with at least onesequence-specific cleaving agent to provide one or more duplex cleavagefragments; (g) ligating an adaptamer comprising an RNA polymerasepromoter site to one or more of said cleavage fragments; and (h)amplifying the one or more ligated cleavage fragments using theidentimer to produce one or more amplified fragments comprisingsequences complementary to a 3′ end of the mRNA, (i) identifying andcharacterizing the cleavage fragments according to the presence of themarker, the sequences corresponding to the VNx nucleotide sequence andthe sequence associated with the sequence-specific cleaving agent, andthe size of the fragment, and (j) identifying any gene associated withthe cleavage fragments by comparing the sequence and sizecharacteristics of the cleavage fragment with a database contactingsequence and size characteristics of RNAs associated with known genes,whereby said comparison is conducted by means of software operated on amicroprocessor, where said detectable marker is comprised of afluorescent molecule, a hapten, biotin, a radioisotope, or anycombination thereof.